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Abstract 

For many tasks and data types, there are natural transformations to which the 
data should be invariant or insensitive. For instance, in visual recognition, natural 
images should be insensitive to rotation and translation. This requirement and 
its implications have been important in many machine learning applications, and 
tolerance for image transformations was primarily achieved by using robust feature 
vectors. In this paper we propose a novel and computationally efficient way to 
learn a local Mahalanobis metric per datum, and show how we can learn a local 
invariant metric to any transformation in order to improve performance. 

Metric learning is a machine learning task which learns a distance metric d(x, y) 
between data points, based on data instances. As distances play an important role in 
many machine learning algorithms, e.g. k-Nearest Neighbor and k-Means clustering, 
finding an appropriate metric for the task can improve performance considerably. This 
approach has been applied successfully to many problems such as face identification 
GD, image retrieval G20, ranking M and clustering ll22l to name just a few. 

A standard approach to metric learning is to learn a global Mahalanobis metric 

d(x,y) 2 M = (x-y) T M(x-y) (1) 

Where M is a positive semi-definite matrix (PSD). The PSD constraint only assures 
this is a pseudometric , but for simplicity we will not make this distinction. Various 
algorithms mm® differ by the objective through which they learn the matrix M 
from the data. As M is a PSD matrix, it can be written as M = L T L and therefore 

d(x,y) 2 M = (x — y) T M{x — y) = \ \x - y\\\ 
x = Lx, y = Ly. 

This means that finding an optimal Mahalanobis distance is equivalent to finding the 
optimal linear transformation on the data, and then using L 2 distance on the trans¬ 
formed data. This approach has two limitations, first it is limited to linear transforma¬ 
tion. Second, it requires a large amount of labeled data. 

One approach that can be used to overcome the first limitation is to use local dis¬ 
tances ED where we learn a unique distance function per training datum. Local ap¬ 
proaches do not produce, in general, a global metric (as they are usually not symmet¬ 
rical) but are commonly considered metric learning nonetheless. These methods, in 
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general, need similar and dissimilar training data for each local metric. 

In our current work we use a local approach inspired by the work on exemplar-S VM 
|fT8l , that showed that using only negative examples can suffice for good performance. 
The intuition behind this is that objects of the same class do not necessarily have to 
be similar, but objects from different classes must be dissimilar. We will show how to 
learn a local Mahalanobis distance that for each datum tries to keep the non-class as 
far away as possible. This approach can use a large amount of weakly supervised data, 
as in many cases negative examples are easier then positive examples to acquire. For 
example, if we are interested in face identification, we can learn a local metric around 
a query face image given a bank of train face images, which we only assume do not 
belong to the queried person. Unlike other metric learning methods, we will not need 
any labels on which image belongs to which person in the negative set. 

The intuition why Mahalanobis distances are the natural model for local metrics 
is simple. Assume we have some metric d(x, y ) on the dataset and assume that it 
is smooth (at least continuously twice differentiable). From the metric properties we 
know that if we fix x and look at f(y) = d(y,x ) then / has a global minimum at 
y = x. Applying second order Tylor approximation to / around x we get 

d{y, x) = f(y) « f(x) + (y - x) T S7,f(x)+ 

(:V - x) T V 2 f(x)(y -x) = (y- x) T V 2 f(x){y - x) 

The equality holds since x is the global minimum with value f(x) = d(x, x) = 0, and 
this also implies that V 2 /(a") is positive semidefinite. While the Taylor approxima¬ 
tion only holds for values of y close to x, as metric methods such as k-NN focus on 
similar objects the approximation should be good at the points of interest. This obser¬ 
vation leads us to look for local matrices that are of the form of a Mahalanobis distance. 

We will first define our local Mahalanobis distance learning method as a semidef¬ 
inite programming problem. We will then show how this problem can be solved effi¬ 
ciently without any costly matrix decompositions. This allows us to solve high dimen¬ 
sional problems that regular semidefinite solvers cannot handle. 

The second major contribution of this paper will be to show how invariant local 
matrices can be learned. In many cases we know there are simple transformations that 
our metric should not be sensitive to. For example, small translation and rotation on 
natural images. We know a priori that if x' = T{ x), where T is the said transformation, 
then d(x, x') ~ 0. We will show how this prior knowledge about our data can by 
incorporated by learning a local invariant metric. This also can be done in an efficient 
manner, and we will show that this improves performance in our experiments. 

1 Related work 

Metric learning is an active research held with many algorithms, generally divided into 
linear ET1 which learn a Mahalanobis distance, non-linear lfl4l that learn a nonlinear 


2 


transformation and use A 2 distance on the transformed space, and local which learn a 
metric per datum. The LMNN and MLMM l2P algorithm are considered the leading 
metric learning method. For a recent comprehensive survey that covers linear, non¬ 
linear and local methods see 0. 

The exemplar-SVM algorithm lfl8l can be seen as a local similarity measure. This 
is obtained by maximizing margins, with a linear model, and is weakly supervised as 
our work. Unlike exemplar-SVM, we learn a Mahalanobis matrix and can learn an 
invariant metric. Another related work is PMLM f2Ql , which also finds a local Maha¬ 
lanobis metric for each data point. However, this method uses global constraints, and 
therefore cannot work with weakly supervised data, i.e. a single positive example. All 
the techniques above do not learn local invariant metrics. 

The most common way to achieve invariance, or at least insensitivity, to a trans¬ 
formation in computer vision applications is by using hand-crafted descriptors such as 
SIFT lim or HOG (Sj. Another way, used in convolutional networks 11 51 . is by adding 
pooling and subsampling forcing the net to be insensitive to small translations. It is im¬ 
portant to note that transformations such as rotations have a global behaviour, i.e. there 
is a global consistency between the pixel movement. This global consistency is not 
totally captured by the pooling and subsampling. As we will see in our experiments, 
using an invariant metric can be useful even when working with robust features such 
as HOG. 


2 Local Mahalanobis 

In this section we will show how a local Mahalanobis distance with maximal margin 
can be learned in a fast and simple way. 

We will assume that we are given a single query image x$ that belong to some 
class, e.g. a face of a person. We will also be given a set of negative data X\,...,xn 
that do not belong to that class, e.g. a set of face images of various other people. We 
will learn a local Mahalanobis metric for Xo. M(x o) A 0, where MAO means M is 
positive semi-definite. For matrices M, N, we will denote by \\M\ \ the Frobinous norm 
||M|| 2 = Mfj and by (M, N) the standard inner product (M, N) = Y2ij M^-TV^. 

We wish to find a Mahalanobis matrix M given the positive datum xq and the 
negative data xi ,..., n n . Large margin methods have been very successful in metric 
learning ED, and more generally in machine learning, therefore, our algorithm will 
look for the PSD matrix M that maximizes the distance to the closest negative example 

M = argmax( min (xi — xq) t M(xt — Xq)) (3) 

Af>0 l<i<n 

The optimization cannot be solved as it is not bounded, since multiplying M by 
a scalar multiplies the minimum distance by the same scalar. This can be solved by 
normalizing M to have ||M|| = 1. As normally done with margin methods, we can 
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minimize the norm under fixed margin constrained instead of maximizing the margin 
under fixed norm constraint. The resulting objective is 

M(x o) = arg min -11 M\ | 2 

subject to : (Xi — Xq ) T M(xj — Xo) >2 Vi £ {1, ...,n} 

M tO 

Where the constant 2 is arbitrary and will be convenient later on. While this is a con¬ 
vex semidefinite programming task, it is very slow for reasonable dimensional data (in 
the thousands) even for state of the art solvers. This is because PSD solvers apply a 
projection to the semidefinite cone, performing an expensive singular value or eigen 
decomposition at each iteration. 

To solve this optimization in a fast manner we will first relax the PSD constraint 
and look at the following objective 

M(xo) = arg min ^ 11 M\ \ 2 

subject to : [xi — Xq ) T M(xi — xq) >2 Vi £ {1, ...,n} 

We will then see how this is equivalent to a kernel SVM problem with a quadratic 
kernel, and therefore can be solved easily with off-the-shelf SVM solvers such as LIB- 
SVM 0. Finally we will show how the solution of objective |5]is in fact the solution of 
objective[4]resulting in a fast solution to objective[4]without any matrix decomposition. 

Theorem 1. The solution of objective [5] is given by running kernel SVM with kernel 
k(x, y) = (x, y) 2 on inputs xo, x\, ■■■, x n where x 2 ; = Xi — x o 

Proof. Define <p(x) = x ■ x T , a function that maps a column vector to a matrix. This 
function has the following simple properties: 

• k(x , y) = (x, y) 2 = (ip(x), <p{y))* he. the function ip is the mapping associated 
with the quadratic kernel. 

• For any matrix W. we have (W, <p(x)) = x T Wx. 

which can be easily verified using <p(x)ij = XiXj. We can define auxiliary labelling 
yo = —1 and y, = 1 for 1 < i < n. Combining everything objective [5] can be rewritten 
as 

M(x 0 ) = argmini||M|| 2 

subject to : yi ■ (( M, <p(xi )) — 1) > 1 V* £ {0,1,..., n} 
where we can include i = 0 as (M, <p(xo)) = 0 for any matrix. 
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Objective [6] is exactly an SVM problem with quadratic kernel, with bias fixed to 
one, given inputs xo,...,x n , proving the theorem. Notice also that for the identity 
matrix M = I we have (M, (fi(xj)) > 0 for i > 1 and (M, tp(xi)) = 0 for i = 0, 
therefore the data is separable by M = I and the optimization is feasible. □ 

Now that we have shown how objective [5] can be converted into a standard SVM 
form, for which efficient solvers exists, we will show how it is the solution to objective 

El 

Theorem 2. The solution to objective [5] is the solution to objective [3] 

Proof. To prove the theorem it suffices to show that the solution is indeed positive 
semidefinite. A well known observation arrising from the dual formulation of the SVM 
objective El is that the optimal solution M has the form 

n 

M = ^2 OLiyiy(xi), Oii > 0. (7) 

2=0 

Since ip(x) A 0 for any x, as its only nonzero eigenvalue is ||a;||, ip(x o) = <p(0) = 0, 
and y, = 1 for * > 1 we get 


M = ^2 otiyiT^Xi) = ^2 a iT(xi) h 0, (8) 

2=0 2=1 

where the positive semidefiniteness in eq. [8] is assured due to the set of PSD matrices 
being a convex cone. itj 

Combining theorem |T| with theorem [2] we get that in order to solve objective [4] it 
is enough to run an SVM solver with a quadratic kernel function, thus avoiding any 
matrix decomposition. 

Looking at this as a SVM problem has further benefits. The SVM solvers do not 
compute M directly, but return the set of support vectors x t] ..... x H and coefficients 
ctjj,.... cti k such that M = a ikT(xi k )- This allows us to work in high dimension 
d, where the 0{d 2 ) memory needed to store the matrix can be a problem, and can slow 
computations further. As the rank of the matrix is bounded by the number of support 
vectors, one can see that in many applications we get a relatively low rank matrix. This 
bound on the rank can be improved by using sparse-SVM algorithms J7|. In practice 
we got low rank matrices without resorting to sparse SVM solvers. 

3 Local Invariant Mahalanobis 

For some applications, we know a priori that certain transformations should have a 
small effect on the metric. We will show how to include this knowledge into the local 
metric we learn, learning locally invariant metrices. In section[4]we will see this has a 
major effect on performance. 
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Assume we know a set of functions 7j..... 1). that the desired metric should be 
insensitive to, i.e. d(x, Tfx)) should be small for all x and i. A canonical example is 
small rotations and translations on natural images. One of the major issues in computer 
vision arises from the instability of the pixel representation to these transformations. 
Various descriptors such as SIFT fTTl and HOG |8] offer a more robust representation, 
and have been highly successful in many computer vision applications. We will show 
in section[4]that even when using a relatively robust representation such as HOG, learn¬ 
ing an invariant metric has a significant impact. 


A natural way to mathematically formulate the idea of being insensitive to a trans¬ 
formation, is to require the leading term of the approximation to vanish in that direction. 
In our case this means 

(T(x o) - Xo) T V 2 d(x 0 , y){T(x 0 ) - x 0 ) = 0. (9) 

If we return to our basic intuition of the local Mahalanobis matrix as the Hessian 
matrix V 2 d(x o, y), we can now state the new local invariant Mahalanobis objective 

M(xo) = argmin i||M|| 2 
subject to : 

(xi — x 0 ) T M(xi - x 0 ) > 2 Viejl, ...,n} 

(Tj(x 0 ) - x 0 ) T M(Tj(x 0 ) - x 0 ) = 0 Vj e {1, ...,k} 

MAO 


We will show how by applying a small transformation to the data, we can reduce 
this to objective[4]which we can solved easily. 

Theorem 3. Define V = span{Ti(xo) — Xo, ■■■, Tfc(xo) — xq}, then the minimizer 
of objective 5] with Xi — xq replaced by Zi, its projection to V^ is the minimizer of 
objective\W 

Proof. For PSD matrices, the constraint that (Tj(xo) — Xq) t M(T j(xo) — Xq) = 0 is 
equivalent to M(Tj(x o) — xo) = 0. This can be seen if we write the vector in the 
basis of M eigenvectors, and notice that components with positive eigenvalues have a 
positive contribution to the quadratic form. This means that Mv = 0 for all v £ V. 
Each vector x* — Xo can be split into two orthogonal elements, x* — Xq = Zi + Vi where 
Vi is its projection onto V and z r is its projection onto V /J . Our equality constraints 
{Tj(x o) — Xq) t M(Tj(xo) — Xq) = 0 now imply 


(xi - x 0 ) T M{xi - x 0 ) = (Zi + Vi) T M(zi + Vi) = zjMzi 


(ID 


since all the other terms vanish. We can now rewrite objective [TO] 


as 


M(x 0 ) = argmin *||M|| 2 

M i 

subject to : zjMzi >2 Mi £ {l,...,n} (J2) 

Mv = 0 \/v £ V 
MAO 
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If we forget the equality constrains we get objective [4] with x, — xo replaced by 2 ,. 
To finish the proof we need to show that the solution to the optimization without the 
equality constraints, does indeed satisfy them. 


As we have already seen in the proof of theorem [2] the optimal solution is of the 
form M = otnp(zi) = ^ o-iZi • zj. The vector 2 ,; is a member of V T so for v € V 



Proving that the solution satisfies the equality constraints. 


□ 


A few comments are worth noting about this formulation. First the problem may 
not be linearly separable, although in our experiments with real data we did not en¬ 
counter any unseperable case. This can be easily solved, if needed, by the standard 
method of adding slack variables. Second, the algorithm just adds a simple preprocess¬ 
ing step to the previous algorithm and runs in approximately the same time. 

4 Experiments 

4.1 Running time 

We compared running the optimization with an SVM solver O, to solving it as a 
semidefinite problem and as a quadratic problem (relaxing the semidefinite constraint). 
The main limitation when running off-the-shelf solvers is memory. Quadratic or semidef¬ 
inite solvers need a constraint matrix, which in our case is a full matrix of size n x d 2 
where n is the number of samples and d is the data dimension. We tested all three ap¬ 
proaches on the MNIST dataset of dimension 784 using only 5000 negative examples, 
as this already resulted in a matrix of size 24.6Gb. 

Currently first order methods, such as ADMM 0, are the leading approaches to 
solving problems such as quadratic and semidefinite programming for large matrices. 
We used YALMIP for modeling and solved using SCS ED- The time to run this as an 
semidefinite program was 1152 ± 417sec. The time it took to run this as a quadratic 
program was 545±74sec. In comparison, when we run this as an SVM problem it took 
at most 0.36sec. We excluded the time needed to build the nx d 2 constraint matrix for 
the quadratic and semidefinite solvers. 

This order of magnitude improvement should not be a surprise. It is a well known 
that while SVM can be solved as a quadratic program, generic quadratic solvers per¬ 
form much slower then solver designed specifically for SVM. 

4.2 MNIST 

The MNIST dataset is a well known digit recognition dataset, comprising of 28 x 28 
grayscale images on which we perform deskewing preprocessing. For each of the 
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Table 1: Classification error for MNIST dataset. 


Method 

Error 

eSVM 

1.75% 

eSVM+shifts 

1.59% 

Local Mahal 

1.69% 

quadSVM+shifts 

1.50% 

inv-Mahal (our method) 

1.26% 

LMNN 

1.69% 

MLMNN 

1.18% 


60, 000 training images we computed a local Mahalanobis distance and local invariant 
Mahalanobis(using only negative examples). On test time we performed knn clas¬ 
sification with k = 3 using the local metrics. We show some examples of nearest 
neighbours in Figure |4~2| We compared this with exemplar-SVM, as the leading tech¬ 
nique most similar to ours. We also compared our scores to exemplar-SVM where we 
add the tansformed images as positive training data. To show the importance of the 
invariance objective, we compare also to SVM with quadratic kernel to which we add 
the transformed data as positive training data (unlike the way we use the shifted data). 
Finally, we compared our results to the state-of-the-art metric learning LMNN method 
(linear metric), and to MLMNN, a local version of LMNN, which learns multiple met¬ 
rics (but not one per datum). 

As can be seen in table[T] we perform much better then exemplar SVM and are com¬ 
parable with MLMNN. It is important to note that unlike MLMNN, we compare each 
datum only to negatives, so our methods is applicable in scenarios where MLMNN is 
not. 
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Figure 1: Nearest neighbour for various matrices, (a) original image (b) L 2 distance 
(c) exemplar-SVM (d) local-Mahalanobis (e) local invariant Mahalanobis 

Another key observation is the difference between the invariant-Mahalanobis and 
the quadratic-SVM with shifts. While very similar functionally, we see that looking at 
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the problem as a local Mahalanobis matrix gives important intuition, i.e. the way to 
use the shifted images, that leads to better performance. 

4.3 Labeling faces in the wild (LFW) 

LFW is a challenging dataset containing 13,233 face images of 5749 different individ¬ 
uals with a high level of variability. The LFW dataset is divided into 10 subsets, when 
the task is to classify 600 pairs of images from one subset to same/not-same using the 
other 9 subsets as training data. We perform the unsupervised LFW task, where we do 
not use any labelling inside the training images we get, besides the fact that they are 
different than both test images. 

We used the aligned images lfl3l and represented using HOG features (8). We 
compared our results to a cosine similarity baseline, to exemplar-SVM and exemplar- 
SVM with shifts. We note that we cannot use LMNN or MLMNN on this data, as we 
only have negative images with a single positive image. 


Table 2: Classification error for LFW dataset. 


Method 


Error 


COSINE SIMILARITY 30.57± 1.4% 

ESVM 26.90±2.2% 

eSVM+SHIFTS 27.12±2.3% 

Local Mahal 19.85±1.3% 

INV-MAHAL (OUR METHOD) 19.48±1.5% 


As we can see from table[2} the local Mahalanobis greatly out-performs the exemplar- 
SVM. We also see that even when using robust features such as HOG, learning an 
invariant metric improves performance, albeit to a lesser degree. 


5 Summary 

We showed an efficient way to learn a local Mahalanobis metric given a query datum 
and a set of negative data points. We have also shown how to incorporate prior knowl¬ 
edge about our data, in particular the transformations to which it should be robust, and 
use it to learn locally invariant metrics. We have shown that our methods are compet¬ 
itive with leading methods while being applicable to other scenarios where methods 
such as LMNN and MLMNN cannot be used. 
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